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Abstract 

Identifying and understanding modular organizations is cen- 
trally important in the study of complex systems. Sev- 
eral approaches to this problem have been advanced, many 
framed in information-theoretic terms. Our treatment starts 
from the complementary point of view of statistical model- 
ing and prediction of dynamical systems. It is known that 
for finite amounts of training data, simpler models can have 
greater predictive power than more complex ones. We use the 
trade-off between model simplicity and predictive accuracy 
to generate optimal multiscale decompositions of dynami- 
cal networks into weakly-coupled, simple modules. State- 
dependent and causal versions of our method are also pro- 
posed. 



Introduction 

The study of complex dynamical systems - such as gene 



regulatory networks (Han et al. 2004), structural and func- 


tional brain networks ( Bullmore and 


Sporns 2009|l, ecolog- 


ical food webs (Krause et al.||2003 


!, and others (|Hartwell| 


|et al.||1999||Schlosser and Wagner 


2004;) - has frequently 



uncovered the presence of modularity. Broadly speaking, 
modular systems are composed of tightly-integrated subsys- 
tems, called modules, which are in turn weakly coupled to 
one another. 

Numerous explanations have been proposed for the func- 
tion of modularity in complex systems, only a few of which 
are mentioned here. Simon ( 1962 ) suggested that modular- 
ity can contain the effects of harmful perturbations and lead 
to greater developmental and operational robustness, espe- 



cially when modules are hierarchically arranged. Kashtan 



and Alon ( 2005 \ argued that modular systems can take ad- 



vantage of reusability when adapting to changing combi- 
nations of fixed environmental tasks. |Tononi et al.| (j 1998 > 
proposed that modularity balances the conflicting needs for 
subsystems that are functionally specialized but also inte- 
grated into globally coherent states. Notably, it has also been 
shown to arise as a result of non-adaptive processes, such as 



neutral evolution of gene regulatory networks ( Force et al 
|2005[ |Sole and Valv erde 2008 ) and stochastic fluctuations 
in network connectivity patterns ([Guimera et al. , 2004). 



Though the concept of modularity has acquired a central 
place in the study of complex systems, its meaning and op- 
erationalization varies widely between scientific paradigms, 
fields, and processes of interest. In the biological sciences 
alone, one can find references to structural, developmen- 
tal, physiological, variational, and functional modularity 



( Winther 2001 Wagner et al. 2007 1, among others. In this 
work, we propose a formal notion of modularity based on 
statistical modeling. Our approach applies to a broad class 
of discrete-time multivariate dynamics, whether represented 
by dynamic models, such as Boolean or dynamic Bayesian 
networks, or empirical distributions estimated from time se- 
ries recordings. Unlike much recent work on community- 
structure in static graphs, we identify modularity in the or- 
ganization of dynamically interacting components. We ar- 
gue that in addition to being useful for analysis of real-life 
dynamical systems, our approach can shed light on connec- 
tions between notions of modularity utilized in different do- 
mains, as well as the general role of modularity in modeling. 

The next section provides a brief background on infor- 
mation theory. We then outline traditional information- 
theoretic approaches to modularity in dynamical systems, 
and develop our own treatment in terms of statistical mod- 
eling. After applying it to an example dynamical system, 
we consider state-dependent and causal versions of modular 
decompositions. We conclude by discussing issues of pa- 
rameterization, directions for further work, and connections 
between our method and broader questions of modeling. 

Information theory 

Information theory provides principled measures of infor- 
mation transfer and statistical dependence in distributed sys- 
tems. As such, it is well-suited for quantifying measures of 
coupling and modularity. 

To review, Shannon entropy measures the uncertainty in 
the measurement outcomes of a random variable. If X is a 
discrete random variable with an associated probability dis- 
tribution P(X), then its entropy is: 

H(X) = -J2 P(x)\ogP(x) 

x£X 



A random variable that takes a single value with probabil- 
ity 1 has an entropy of 0, while an equiprobable random 
variable assumes the maximum entropy of log \X\, where 
| JSC | is the number of possible outcomes. When the base of 
the logarithm is 2, as in this work, the units of entropy are 
bits ( 1 bit is the uncertainty in the choice between 2 equally 
possible outcomes). Because measuring a variable reduces 
uncertainty about its value, entropy can also be considered a 
measure of information. 

When provided with a joint distribution over two random 
variables such as P(X,Y), conditional entropy measures 
the expected uncertainty in the value of one variable given 
that the value of the other is known: 

H(X\Y) = H(X, Y) - H(Y) = - ]T P(x, y) log P(x\y) 

x,y 

Mutual information is a symmetric measure of nonlinear 
correlation between two random variables. Expressed as the 
difference between entropy and conditional entropy, it can 
be interpreted as the reduction in uncertainty about the value 
of one random variable provided by knowledge of the other: 



I(X;Y) 



H(X)+H{Y)-H(Y,X) 
H(X) - H(X\Y) = H(Y) - H(Y\X) 
P{x,y) 



P(x, y) log 



P(x)P(y) 



Mutual information captures the amount of constraint in 
the joint distribution of two variables not present in their 
marginal distributions. It is equal to when two variables 
are statistically independent, and reaches its maximum pos- 
sible value of mm{H(X), H(Y)} when one variable is a 
function of the other. 

Mutual information can be extended to the case 
of more than two variables. Let random vector 
X=(Xl, X2, ■ ■ • , X£) with distribution P(X) represent the 
state of a system composed of L distinct variables. The total 
constraint in this system not present in any single variable 
is measured by a multivariate version of mutual informa- 



tion, often called multi-information (Studeny and Vejnarova 



1998) or integration ( |Tononi et al)|1994[ >: 



I(X) = ^tf(JQ)-if(X) 



(1) 



= ]Tp(x) log 



P(x) 



Kullback-Leibler (KL) divergence is a measure of the dif- 
ference between two distributions: 



KL(P||Q) = ^P( 2 ;)log 



Q(x) 



(2) 



distance because it is not symmetric. Importantly, many 
information-theoretic measures can be restated in terms of 
KL divergence. For example, the multi-information of eq. [T] 
is equal to the KL divergence between the distribution of X 
and a product of the marginal distributions over the individ- 
ual variables of X. 

Modularity in multivariate dynamics 

As previously mentioned, multi-information measures the 
total amount of higher-order constraint present among the 
variables of a multivariate system. It is when these vari- 
ables are independent, and increases when more statistical 



interaction between variables is present (Studeny and Vej- 



narova 



1998 ). For this reason, many formal approaches to 



modularity search for system transformations that minimize 
this measure. 

Several kinds of transformations can be investigated. In 
dependent component analysis attempts to minimize multi 
information over the space of linear mappings (coordi 
nate changes) of a multivariate system (Hyvarinen and Oja 



2000). A different approach, closer to the one pursued 
here, looks for partitions of system variables with low multi- 
information. 

A partition tt of set S is a set of mutually exclu- 
sive, nonempty subsets B C S, called blocks, such that 
\J B e, B = S. For example, {{1},{2,3}} and {{1,2,3}} 
are two possible partitions of the set {1, 2, 3}. We also use 
a more concise notation: the two partitions above, for ex- 
ample, can be referred to as 1/23 and 123 respectively. Ad- 
ditionally, ttq is used to indicate the total partition, which 
includes the entire set in a single block, i.e. ttq = {S}. 

We look at partitions of V = {1, . . . , L}, the set of in- 
dexes of the variables of random vector X. For partition tt 
and block B G tt, P(Xb) indicates the marginalization of 
P(X) onto the variables whose indexes are in B. For exam- 
ple, P(X{! 2}) is the marginal distribution of the first two 
variables of X. 

We define the multi-information of partition tt as: 

Z*(X) = ]T H(X B ) - P(X) 

BEtt 

This measure quantifies the amount of constraint holding 
among the blocks of tt. Finding partitions with low multi- 
information corresponds to identifying weakly-coupled sub- 
systems. Variations on this theme appear in information- 
theoretic treatments of modularity starting from early cyber- 



It is always positive and iff P = Q, though it is not a 



netics (Conant 1972) to more recent approaches in compu- 
tational neuroscience ( |Tononi and Sporns]|2003| ). 

Multi-information is defined over a time-invariant distri- 
bution of system states. Though it does not account for the 
dynamic flow of information within a system, it can be gen- 
eralized to this case. Assume a multivariate system with 
Markovian dynamics represented by P(X' = x'|X = x), 
the conditional probability distribution of transitioning to 
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Figure 1: A simple four node Boolean network (nodes 1, 2, 
3, and 4 perform OR, AND, majority, and OR update func- 
tions respectively). Its full state transition table is shown in 
center. On the right, the stochastic interaction of every pos- 
sible partition of the network. 

each future state x' given starting state x, as well as 
P(X = x), the distribution over starting states^ The 
amount of information flowing dynamically among the 



blocks of 7r is called stochastic interaction (Ay and Wen- 
2003) >. It is a conditional version of KL divergence 



nekers 



between the transition distribution of the whole system and 
the product of marginal transition distributions of the vari- 
able blocks specified by partition it: 



Z*(X'|X) 



= KL 



H(X! B \X B )-H(X!\X) (3) 



P(X'|X) 



These kinds of dynamic generalizations of multi- 
information have recently been proposed as measures 



of system-wide coupling in brain dynamics (Balduzzi and 
|Tononi| [20081 |Barrett et al.|[20TT] |. 

A simple demonstration is provided by the Boolean net- 
work in fig. [T] It has four nodes, whose update functions are 
OR, AND, majority rule, and OR respectively. The stochas- 
tic interaction of each possible partition is provided, assum- 
ing a uniform distribution over starting states. For exam- 
ple, the partition 12/34 is the bi-partition having the lowest 
stochastic interaction: the block {1,2} has conditional en- 
tropy P^X'^ 2 }|X{i ,2}) = (nodes 1 and 2 do not depend 
on the rest of the system, so their marginalized dynamics 
are deterministic), while block {3,4} has conditional en- 
tropy H (X'| 3 4 j |X{ 3 4 }) = 0.5. Because the system as a 



'We assume that the dynamics are stationary, in that the tran- 
sition probability distribution does not change through time. Our 
analysis can also be applied to higher-order Markovian systems, 
though for simplicity they are not considered here. 



whole is deterministic, P(X'|X) = and the total stochas- 
tic interaction of partition 12/34 is H (X^ 2 j |X{i .2}) + 

P(X' |X {M} )-P(X'|X) = 0.5. 

Unfortunately, stochastic interaction is not a suitable cost 
function for identifying modular partitions of a multivari- 
ate dynamical system (similarly for multi-information and 
multivariate non-dynamical systems). In any such system, 
a minimal stochastic interaction of will be assigned to the 
total partition ttq, and generally a partition will never have 
a greater stochastic interaction than any of its refinements 
(where one partition is a refinement of another if every block 
of the former is a subset of some block of the latter). Select- 
ing partitions using stochastic interaction will thus favor par- 
titions with large blocks, the total partition being a (possibly 
non-unique) global minimum. 

Due to this, several authors have proposed normalizing 
factors that penalize large partitions ( Conant 1972||Balduzzi| 
|and Tononij|2008| l. However, the derivation and justification 
of these normalizing terms is ad hoc. In this work, we ap- 
proach the problem of identifying modules from the point 
of view of statistical prediction. This yields principled pe- 
nalization terms for large partitions and leads us to uncover 
modular decompositions with clear interpretations in terms 
of statistical modeling. 

Statistical modeling and modular 
decompositions 

Information theory is intimately connected with statistical 
modeling (Rissanen 2007]). For example, assume a model 
that assigns a probability value to data x: 



Q(x) = / Q{-x.\6)lo{0)&0 
Je 



(4) 



This term, called the marginal likelihood in the Bayesian 
literature, is the expectation of the likehood function Q(x\8) 
with respect to distribution uj(0) over parameter values. 

Q(x) is a measure of predictive fit to data, and its log- 
arithm is often maximized over parameter distributions or 
model choices. Equivalently, one can minimize the negative 
of its logarithm, a measure of predictive error called log loss. 
If data samples are drawn from some true probability distri- 
bution P(X = x), then the expectation of the log loss of the 
marginal likelihood is: 

- ]T P(x) log Q(x) = KL(P||Q) + H (P(X)) 
xex 

The KL term (from eq. |2]i is non-negative, and reaches its 
minimum of when the model is perfectly fit, i.e. Q = 
P. It is a measure of excess prediction error of the model 
above the minimum possible. This minimum is specified by 
the entropy term, and depends only on the true distribution 
P(X) and not on model or parameter choices. 

A similar situation holds in the dynamic setting. We call 



dynamic models those that generate conditional distributions 
of multivariate future states x' given starting states x: 

Q(x'|x) = f Q(x'|x,0)w(0)d0 
Je 

We look at statistical prediction of dynamical systems from 
the perspective of an agent who does not possess a perfectly 
fit model, but must learn a dynamic model given previous 
observations. The agent is provided with a set of factorized 
models: for each partition of system variables n, there is 
a dynamic model Q„ whose parameters and marginal like- 
lihood obey the independence conditions imposed by the 
block structure of ir: 



Q 7r (x'|x) = Q^x'bIxb) 



(5) 



B£tt 



The predictive performance of our agent depends on the 
chosen model and the amount of previously observed data. 
It can be quantified with a risk function, which here is the 
KL divergence between the true distribution P(X'|X) and 
the distribution predicted by a dynamic model (Haussler and 



Opper 1997 1. The risk of model on the next sample, 



after observing N previous samples, is: 
r N , Q „ = KL[P(X'|X)|[Q )r (X / |X J X ,1 " JV ,X 1 - JV )] 



(6) 



The expectation in the KL term is taken over the next sample 
of X', X, as well as N previous i.i.d. samples X' 1 , X 1JV . 
The Bayesian posterior predictive distribution: 

Q T (x'|x,x' 1 - Ar ,x 1 - Ar )=|g^x'|x,0)Q 7r (0|x' 1 - Ar ,x 1 - Ar )d0 

is the marginal likelihood of eq. [4] with the distribution over 
parameter values conditioned on N previous data samples. 
From the point of view of machine learning, such Bayesian 
updating of parameters in light of observed data corresponds 
to model training, while evaluating the expected model risk 
on new samples corresponds to model testing. More con- 
cretely, our dynamic models can be considered supervised 
learners: given data, they infer probabilistic mappings from 
inputs (starting states X) to outputs (future states X')- 

Given the independence assumption of eq. [5] risk r^.Q,, 
becomes: 

UX' |X)+£ KL[P(X' B |X fl ) || Q^Xy X B ,X't N ) X 1 B - N )] 
Be-ir 

This form draws attention to the two components that con- 
tribute to risk (that is, predictive error). The stochastic inter- 
action term (see also eq. [3]l arises as a consequence of ignor- 
ing dynamic coupling between variables in different blocks. 
It is the minimal excess error of a factorized model (in which 
the dynamics of the variable blocks induced by partition ir 
are independent) above an optimally fit whole-system model 
(where interactions between all variables can be captured). 



The second term, called the complexity term, reflects the 
excess predictive error of a trained model above the min- 
imum possible. It arises because a model trained on a fi- 
nite amount of data maintains some uncertainty about opti- 
mal parameter values. For a given amount of training data, 
complex models (with larger parameter spaces) will have 
greater parameter uncertainty than simpler models, resulting 
in more excess predictive error. As N — > oo, the complexity 
term can be asymptotically approximated by 4fk, where d n 
refers to the number of para meters of model Q 
1996 Barron and Hengartner 1998 1. This yields 



7T ( 



Komaki 



rN,Q„ ~ Z^X'IX) 



2N 



(7) 



For a given amount of training data N, the model with the 
lowest risk, 

Q*(N) = argminrjv.Q, 

corresponds to the partition providing an optimal predictive 
decomposition of the system. Models that minimize risk of- 
fer a balance between two conflicting constraints: on one 
hand, low stochastic interaction (better predictions under op- 
timal fit), on the other, low model complexity (easier param- 
eter estimation with limited training data). Because parti- 
tions with smaller blocks (which have smaller-state-space 
dynamics representable by fewer parameters) generally in- 
duce simpler models, risk presents a principled cost function 
for identifying small, weakly-coupled modules. The amount 
of data N parameterizes this trade-off: as N increases, em- 
phasis is shifted from the complexity term to the stochastic 
interaction term, and groups of variables whose dynamic in- 
teractions carry the most information while being easiest to 
learn are first to coalesce into multivariate blocks of the op- 
timal modelj^Thus, selecting optimal decompositions while 
increasing the amount of training data generates a modular 
multiscale decomposition of system variables. In the infinite 
data limit, the risk of each model reaches its minimum of 
Z,r(X'|X), and the partition corresponding to Q* becomes 
the one with lowest stochastic integration (the total partition 
being a possibly non-unique minimum). 

Decomposing a dynamical system 

The complexity term in eq. [7] depends on the parametric 
form of the dynamic model. Though a variety of possibili- 



This approximation assumes continuously-parameterized 
models and standard regularity conditions. It also assumes that, 
for all 7r, some parameterization of offers a perfect fit to the 
factorized n_Be7r-P(X' s jX.g). It is possible to generalize beyond 
this case, where the factorizations of the true distribution are 'out- 
of-class' of the models Q^. 

3 Minimizing risk can be seen as a form of information bottle- 
neck (Tishby et al.||1999| >: it searches for factorized models whose 
parameters minimize information about training data while maxi- 
mizing information about system dynamics; the size of the training 
data serves as a trade-off parameter. 




Figure 2: Top: approximate risk for optimally-predictive 
models of the Boolean network from fig. [T] Dots mark 
switches of the optimal model Q*; inset shows first two 
switches. Bottom: cumulative risk, or total accumulated 
prediction error for models plotted in the top graph. Total 
modularity (T) is asymptotic difference between cumulative 
risks of Q1234 and Q* or, alternatively, area between lines 
corresponding to (non-cumulative) risks of Q1234 and Q*. 



ties exist, here our dynamic models are assumed to be prod- 
ucts of first-order Markov chains with Dirichlet priors. The 
number of parameters of model from this class is: 



^ = ^1X^1(1X^1-1) 



(8) 



Ben 



where \X.g\ is the number of supported starting state out- 
comes and \X.' B \ is the number of possible future state out- 
comes of the variables with indexes in block B. For ex- 
ample, for a single block of Boolean variables with a fully 
supported starting state distribution, these are both equal to 
2' B I. For this model class, the complexity term scales expo- 
nentially with the number of variables in each block. 

As an example, we look at optimal decompositions of the 
network in fig. [T] Its risk, calculated using the approxima- 
tion of eq. [7] and parameter counts of eq. [8] is shown at the 



top of fig. 2 4 The risk is plotted for those models which 



reach minimum risk at some point of the training process, 
as well as that of the overall minimal risk model Q* at each 
N. Predictive power is initially optimized by the model cor- 
responding to partition 1/2/3/4 (the simplest model which 
treats all nodes independently). At N m 3 (inset), it is re- 
placed by the model corresponding to partition 12/3/4 (vari- 
ables 1 and 2 now merged into a single block); at N » 4 
(inset), by the model corresponding to partition 12/34; and 
finally at N w 215, the most predictive model becomes the 
one corresponding to the total partition 1234. 

Total modularity 

So far, our measure of modularity has been parameterized by 
N, the amount of training data. Here, we derive a parameter- 
free measure of the total modularity in a dynamical system. 

In our definition of risk (eq. IS), we used the posterior 
predictive distribution (X'lX, X' 1JV , X 1 - JV ), the prob- 
ability assigned to the next data sample by a model trained 
on TV previous data samples. Given our assumptions, the fol- 
lowing relationship holds between the prior predictive dis- 
tribution, the probability an untrained model assigns to N 
data samples, and the posterior predictive distribution: 



7V-1 



n..N\ 



X" 



X n+l X / 



n=0 



This suggests the prequential interpretation of Bayesian 
prediction ( |Dawid| [T992): the expected predictive error of a 
model on N samples is the sum of the expected predictive 
errors on each successive sample after training on the pre- 
vious samples. This accumulated prediction error is termed 
cumulative risk (Haussler and Opper 1997| l: 



Rn.i 



N-l 
n=0 



The risk of eq. |6j can be seen as the rate of change of the 
cumulative risk as the amount of training data grows. 

Total modularity is the total gain in predictive accuracy 
(i.e., decrease in cumulative risk) provided by the optimally 
predictive models Q*(N) versus the unfactorized, total- 



partition model Qng. Let R 



N,Q* 



= E 



JV-l 



=0 ' n,Q*(n) 



be 



the cumulative risk of an agent who selects the risk-minimal 
model at each N. The total modularity is then: 



T = lim (Rn.q^, - R 



N 



N,Q* 



(9) 



Total modularity measures the overall predictive advan- 
tage gained by using factorized models, and is not a function 
of a particular N. High values of total modularity indicate 



4 In general, the approximation of eq. [v]is only accurate for large 
N. However, it suffices for our explanatory purposes. 




Figure 3: Total modularity of two binary variables which 
copy each others' state with probability p and maintain their 
own state with probability 1 — p. Total modularity increases 
as coupling decreases, and diverges as p — >• 0. 



that simpler models have significantly improved predictive 
performance during earlier stages of the learning process]^] 
To use the previous example, the cumulative risk of the mod- 
els plotted at the top of fig. [2]is shown at the bottom of that 
figure. The total modularity of the dynamic network shown 
in fig. [T] is equal to the asymptotic difference between the 
cumulative risks of Q1234 (= Qtt ) and Q*. Equivalently, it 
is also the total area between the lines corresponding to the 
(non-cumulative) risks of Q1234 and Q*. 

For another illustration of total modularity, we consider a 
simple dynamical system composed of two binary variables. 
Each variable is parameterized in the following manner: at 
each time step, with probability p it assumes the value of 
the other variable in the previous time step, and with prob- 
ability 1 — p it maintains its own value from the previous 
time step. The amount of dynamic coupling between the 
two nodes increases with p: at p = the variables have no 
interaction, while at p — 1 their values are completely cor- 
related (with a one timestep lag). This dynamic coupling is 
illustrated in fig. [3] which plots the total modularity of this 
system against the coupling parameter p. The total modu- 
larity monotonically decreases as p increases, showing that 
greater coupling leads to lower total modularity. As p — > 0, 
the two variables become completely independent and total 
modularity diverges (in this case, it grows without bound at 
a rate proportional to log N). 

State-dependent and causal modularity 

The way information flows within a dynamical system can 
depend on the system's state. For example, a partition's 
stochastic interaction can be different in different attractors. 
We can quantify this by different choices of the starting 

5 Minimization of accumulated error by online switching from 
simpler to more complex models is related to a learning framework 
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Figure 4: Risk for two systems, each having two binary vari- 
ables: in system A (left column) each variable copies previ- 
ous value of the other, in system B (right column) each vari- 
able takes opposite of its own previous state, a) and d): Risk 
under uniform starting state distribution. Lowest risk model 
of A becomes the total one, while factorized model remains 
optimal for B. b) and e): Risk and optimal decompositions 
depend on the starting state distribution. Computed over 
P(X=(0,1))=0.5,P(X=(1,0))=0.5, risk and optimal 
decompositions become the same for A and B, though their 
causal organization is different, c) and f): Causal risk leads 
to different decompositions of A and B, even when com- 
puted over same starting state distribution as in b) and e). 



state distribution, -P(X). Though we have generally taken 
P(X) to be a fully-supported uniform distribution, it can be 
weighted preferentially over some subset of starting states. 

For example, consider two systems, each composed of 
two binary variables. In system A, each variable copies the 
previous value of the other, while in system B, each variable 
takes the opposite of its own previous state. Fig. |4] shows 
the risk plots for both A (left column) and B (right col- 
umn), where and|4}l are calculated for a uniform starting 
state distribution. The risk, as well as the optimal decom- 
positions, is different between the two systems: A (which 
performs the copy operation) eventually chooses the total 
partition {{1,2}} as the most predictive, while B (whose 
variables perform independent state flips) never does. 

If, however, a non-uniform starting state distribution 
is chosen, risk and optimal decompositions can change. 
The risk for starting state distribution P(X=(0, 1)) = 
0.5,P(X=(1,0)) = 0.5 are shown in fig. andg (for 
systems A and B respectively). Different parts of the start- 



ing state space induce different risk values and optimal de- 
compositions: for this distribution, fig. [4j) shows that the 
total partition {{1. 2}} is never chosen as the optimally pre- 
dictive one for system A. 

Additionally, for these starting states the transition distri- 
butions of A and B are identical: if either system is started 
in state (0, 1), it deterministic ally transitions to state (1,0), 
and similarly for the transition from (1, 0) to (0, 1). Because 
the observed dynamics of the two systems are identical, the 
risk functions and optimal decompositions are also equal. 
Though systems A and B are defined using different causal 
architectures, here their modular organizations are indistin- 
guishable. Specifically, A is postulated to have a causal con- 
nection among its variables but - for this starting state dis- 
tribution - they display no stochastic interaction. 

This example highlights the difference between statisti- 
cal correlation and causal interaction. To properly handle 
the latter, we utilize a notion of causality based on seman- 
tics of intervention (Pearl] |2000| l, recently developed in an 
information-theoretic direction by |Ay and Polani| ( |2008[ ). In 
Pearl's treatment, conditional probability distributions rep- 
resent not only correlations, but also responses of variables 
to externally-imposed interventions. This is especially natu- 
ral when dynamics of interest are generated by causal mod- 
els, such as dynamic causal Bayesian or Boolean network 
models frequently used in artificial life and systems biology. 

In our example, the functional organization of systems A 
and B can be differentiated - even within the non-uniform 
starting state distribution mentioned above - if the starting 
states of the systems can be intervened upon. This is because 
in system A - but not system B - changing the starting state 
of one variable can change the other variable's future state. 

We consider interventions formally by noting that the risk 
r N,Q„ of eq. [6]need not take the same starting state distribu- 
tion for training data as for the testing data. Instead, we take 
the starting state distribution for training data to be drawn 
i.i.d. from a fully-supported and uniform distribution P(X) 
(the distribution of interventions), while the testing starting 
states can be drawn from any P(X) of interest. We refer to 
risk evaluated under this learning scenario as causal risk: 

?N,Q« = ]TP(x)P(x' |x)[ log P(x'|x)- 

x,x' 

^P(x 1 ^)P(x' 1 - w |x 1 - A ')logQ T (x'|x,x' 1 - N x 1 - JV ) 
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As N — > oo, the posterior predictive distribution of model 
Qtt approaches \{ B ^ P(X^|X B ), where P(X' S |X B ) 
is the whole-system transition distribution P(X'|X) 
marginalized onto variables in block B using P(X). Then, 
?n,q„ can be approximated by: 



where d^, and the expectations in the KL terms use the 
testing starting state distribution. The KL divergence be- 
tween P(X^ |Xs) (the whole-system transition distribution 
marginalized onto variables in block B using P(X) ) and 
P(X' b |Xb) reflects the amount of extra perturbation that 
active interventions inject into block dynamics. The two 
distributions need not be equal, unless P(X) = P(X) or 
the partition under consideration is the total one. Because 
KL divergence is non-negative, causal risk rN,Q„ is not less 
than the statistical risk r^r g^ (compare above to eq. [7J. 

Fig. |4j; and show the causal risk for systems A and 
B (respectively) with P(X= (0, 1)) = 0.5, P(X= (1,0)) = 
0.5. In|4j; - but not^- the total partition model assumes a 
lower risk than the factorized model, indicating that for the 
starting states in question, system A - but not system B - 
has causal interactions between its variables. 

Conclusion 

Modularity is normally treated as an objective property of a 
system's organization. Our approach instead considers from 
the perspective of modeling and prediction. In the context 
of inferring dynamic models from limited data, modularity 
allows for models that are predictive but simple, with the 
amount of training data controlling the trade-off. Our sta- 
tistical treatment connects to previous information-theoretic 
approaches, but goes further by providing principled terms 
for identifying small modules. 

Our approach can also be used to find state-dependent 
modular organizations, both in statistical and causal (inter- 
ventional) senses: models trained on interventional dynam- 
ics but tested on arbitrary distributions give rise to a mea- 
sure that identifies causal modules. This is related to ex- 
isting information-theoretic measures of causal interactions 



T T (X'|X)+]TKL P(X' b |X b ) 
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between subsystems (Tononi and Sporns 2003), but here 
emerges naturally from the framework of statistical model- 
ing. This framework also produces a measure of total mod- 
ularity present in the system, which quantifies the overall 
predictive advantage that modularity provides through the 
entire model inference process. 

As a side note, if the learning of real-world cognitive sys- 
tems (such as scientists or organisms) proceeds in a man- 
ner somewhat similar to the statistical framework presented 
here, our approach suggests why such systems may infer 
modular organizations in the external world: under condi- 
tions of limited data, this assumption can simplify learning 
and lead to gains in predictive power. 

One important issue with our treatment is its model- 
dependence. The complexity penalization term of eq. ^de- 
pends on the model class, and different model classes may 
have different parameterizations and functional forms. Our 
examples employed products of Markov chain models, a 
rather general dynamic model class but one heavily parame- 
terized; others could be used. The choice of model class can 
be thought of as a null model of system dynamics. 



Several generalizations suggest themselves. For example, 
it is possible to infer module timescales by searching not 
only over decompositions, but also model orders (numbers 
of previous states on which transition probabilities depend; 
for inferring Markov chain order, see Strelioff et al. 2007) 1. 
Fuzzy modular organizations, in which a variable can be- 
long to more than one module, can be accommodated by 
allowing partially-overlapping blocks. More generally, the 
model search space could include other structures besides 
partitions (e.g. trees or networks) to impose independence 
constraints on information flow between blocks. 

Identifying modularity in dynamical systems is important 
in complex systems research in general, and biological sys- 
tems modeling in particular. Our method differs from recent 
community-detection methods that find modularity in static 
graphs, in that it focuses on the organization of interactions 
between dynamic system components. In future work, we 
hope to apply it to the analysis of regulatory and signal- 
ing control in biochemical networks, as well as inference 
of functional neural organization from brain recordings. 
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